-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: Add future.python_scalars #63016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Plan to run a full set of ASVs next week, some microbenchmarks from pandas.core.dtypes.cast import maybe_unbox_numpy_scalar
with pd.option_context("python_scalars", True):
%timeit maybe_unbox_numpy_scalar(np.int64(2))
# 828 ns ± 9.91 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit maybe_unbox_numpy_scalar(2)
# 161 ns ± 0.414 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
ser = pd.Series([1, 2, 3] * 10_000)
with pd.option_context("python_scalars", True):
%timeit ser.sum()
# 9.42 μs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
with pd.option_context("python_scalars", False):
%timeit ser.sum()
# 8.28 μs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) |
|
Full ASVs are below, only showing where there was a 10% of more regression. In the full list, only the following two actually hit the function Full listI was curious why only min/max showed up as being regressions in series_methods.NanOps |
|
@jbrockmendel - you good with the ASVs here? |
|
No complaints here. |
| else: | ||
| result = result.reshape(1) | ||
| if using_python_scalars(): | ||
| result = np.array([result]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doing this instead of maybe_unbox_numpy_scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result here prior to L1543 is already a Python scalar when future.python_scalars=True due to calling the reduction function. In this block, keepdims=True so we need to convert it to a NumPy array.
| if isinstance(result, np.longdouble): | ||
| result = float(result) | ||
| else: | ||
| result = value.item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this will mess up on a timedelta64:
obj = np.timedelta64(1, "ns")
assert isinstance(obj, np.generic)
>>> obj.item()
1
I don't know if there are other cases where obj.item() messes up, but I'm wary of it. Heads up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - will add a test for all dtypes. Here is the full list of scalars and their corresponding item type without datetime/timedelta. Only other problematic one is complex256.
? bool <class 'bool'>
b int8 <class 'int'>
h int16 <class 'int'>
i int32 <class 'int'>
l int64 <class 'int'>
q int64 <class 'int'>
n int64 <class 'int'>
p int64 <class 'int'>
B uint8 <class 'int'>
H uint16 <class 'int'>
I uint32 <class 'int'>
L uint64 <class 'int'>
Q uint64 <class 'int'>
N uint64 <class 'int'>
P uint64 <class 'int'>
e float16 <class 'float'>
f float32 <class 'float'>
d float64 <class 'float'>
g float128 <class 'numpy.longdouble'>
F complex64 <class 'complex'>
D complex128 <class 'complex'>
G complex256 <class 'numpy.clongdouble'>
S |S1 <class 'bytes'>
U <U1 <class 'str'>
V |V0 <class 'bytes'>
For datetime/timedelta, the arrow dtypes already return pd.Timedelta and pd.Timestamp. I think we can align the NumPy dtypes to do the same.
| result = getattr(df.C, op)() | ||
| assert isinstance(result, np.float64) | ||
| if using_python_scalars: | ||
| assert isinstance(result, float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think float64 subclasses float, so this won't exclude float64
| if using_python_scalars: | ||
| assert isinstance(result, int) | ||
| else: | ||
| assert result.dtype == "uint64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just check type rather than dtype?
| assert 0 == s.skew() | ||
| assert isinstance(s.skew(), np.float64) # GH53482 | ||
| if using_python_scalars: | ||
| assert isinstance(s.skew(), float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't exclude float64
pandas/tests/series/test_ufunc.py
Outdated
| tm.assert_series_equal(result, expected) | ||
| else: | ||
| expected = values[1] | ||
| if using_python_scalars and values.dtype.kind in ["i", "f"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBD but checking for kind in "if" is very slightly faster than checking for kind in ["i", "f"].
| if isinstance(all_data, Series): | ||
| assert not (std_x < 0).any() | ||
| else: | ||
| assert not (std_x < 0).any().any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess np.bool_(True).any() returns itself? thats kind of convenient. could go in the list joris asked for of potential downsides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular case, I'd say axis=None should be preferred. But perhaps there are others where that is useful.
| if isinstance(all_data, Series): | ||
| assert not (var_x < 0).any() | ||
| else: | ||
| assert not (var_x < 0).any().any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this pattern is going to show up a lot, could make a helper in pd._testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to axis=None, no branching.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Adds an experimental option to return Python scalars instead of NumPy scalars across the API. This is not yet fully implemented everywhere, e.g.
Series.__getitem__, but I'm hoping reductions are a substantial chunk.This is complicated by #62988 where it was found that many of our doctests are not running. We run those doctests using NumPy>=2, and if we were to get those doctests to pass as-is, we would need to change the NumPy reprs from e.g.
2tonp.int64(2). If we then change reductions et al to returning Python scalars, we'd then change all the reprs back from e.g.np.int64(2)to2. So instead I think we can:future.python_scalarstoTruein 4.0, deprecate the future option.